Deep Learning Course Project - Gesture Recognition

Problem Statement

Imagine you are working as a data scientist at a home electronics company which manufactures state of the art smart televisions. You want to develop a cool feature in the smart-TV that can recognise five different gestures performed by the user which will help users control the TV without using a remote.

Each gesture corresponds to a specific command.

Each video is a sequence of 30 frames (or images).

Objectives:

In this group project, we are going to build a 3D Conv model that will be able to predict the 5 gestures correctly. Please import the following libraries to get started.

We set the random seed so that the results don't vary drastically.

In this block, you read the folder names for training and validation. You also set the batch_size here. Note that you set the batch size in such a way that you are able to use the GPU in full capacity. You keep increasing the batch size until the machine throws an error.

Plot the training/validation accuracies/losses

Generator

This is one of the most important part of the code. The overall structure of the generator has been given. In the generator, you are going to preprocess the images as you have images of 2 different dimensions as well as create a batch of video frames. You have to experiment with img_idx, y,z and normalization such that you get high accuracy.

Note here that a video is represented above in the generator as (number of images, height, width, number of channels). Take this into consideration while creating the model architecture.

Model

Here you make the model using different functionalities that Keras provides. Remember to use Conv3D and MaxPooling3D and not Conv2D and Maxpooling2D for a 3D convolution model. You would want to use TimeDistributed while building a Conv2D + RNN model. Also remember that the last layer is the softmax. Design the network in such a way that the model is able to give good accuracy on the least number of parameters so that it can fit in the memory of the webcam.

Sample Model

Now that you have written the model, the next step is to compile the model. When you print the summary of the model, you'll see the total number of parameters you have to train.

Let us create the train_generator and the val_generator which will be used in .fit_generator.

The steps_per_epoch and validation_steps are used by fit_generator to decide the number of next() calls it need to make.

Let us now fit the model. This will start training the model and with the help of the checkpoints, you'll be able to save the model at the end of each epoch.

Sample Cropping

Below are the experiments to see how training time is affected by image resolution, number of images in sequence and batch size

Model-1 Base Model, Batch size = 40, No. of Epochs=15

Model is overfitting, we need to do Data augumentation

Model 2 - Augment Data , (3,3,3) filter & 160x160 image resolution

Model is not overfitting and we get a best validation accuracy of 82% and training accuracy of 91%.

Next we will try to reduce the filter size and image resolution and see if get better results. Moreover since we see minor oscillations in loss, let's try lowering the learning rate to 0.0002

Model 3 - Reduce filter size to (2,2,2) and image resolution to 120 x 120

Model has a best validation accuracy of 75% and training accuracy of 80% . Also we were able to reduce the parameter size by half the earlier model. Let's trying adding more layers

Model 4 - Adding more layers

With more layers we dont see much performance improvement. We get a best validation accuracy of 82% . Let's try adding dropouts at the convolution layers

Model-5 - Adding dropout at convolution layers

Adding dropouts has further reduced validation accuracy as its not to learn generalizable features

All models experimental models above have more than 1 million parameters. Let's try to reduce the model size and see the performance

Model 6 - reducing the number of parameters

For the above low memory foot print model the best validation accuracy of 76%

Model 7 - reducing the number of parameters

For the above low memory foot print model the best validation accuracy of 72%

Model 8 - reducing the number of parameters

For the above low memory foot print model the best validation accuracy of 67%

Model 9 - CNN- LSTM Model

For CNN - LSTM model we get a best validation accuracy of 77%

As we see more cases of overfitting, lets augment the data with slight rotation as well and run the same set of models again

More Augmentation

Model 10 - More Agumentation - (3,3,3) Filter & 160x160 Image resolution - similar to Model 2

Model 11 - More Agumentation (2,2,2) Filter & 120x120 Image resolution - similar to Model 3

Model 12 - More Augmentation and Adding more layers - Similar to model 4

Model 13 - More Agumentation and Adding dropouts - Similar to Model 5

Model 14 - reducing network parameters - Similar to Model 6

Model 15 - reducing network parameters - Similar to model 7

Model 16 - reducing network parameters - Similar to Model 8

Model 17 - CNN LSTM with GRU - Similar to Model 9

We see that overfitting is considerably less when we do more augmentation. However there is not much improvement on accuracy

Model 18 - Transfer Learning

We are not training the mobilenet weights and we see validation accuracy is very poor. Let's train them as well and observe if there is performance improvement

Model - 19 - Transfer Learning with GRU and training all weights

The accuracy results from the above model are 99% for training accuracy and 96% for validation accuracy

Consolidated Final Models

Image1.jpg

Image2.jpg

After doing all the experiments, we finalized Model 9–CNN+LSTM, which performed well.

Reason:

The best weights of CNN-LSTM: model-00016-0.25427-0.90347-0.58490-0.77000.h5 (19.0 MB). we considered this weight for model testing, Let's have look at the performance below

Loading model and Testing